#### **SEC204**

# Computer architectures and low level programming

Dr. Vasilios Kelefouras

Assignment Project Exam Help Email: v.kelefouras@plymouth.ac.uk https://poweggier.com

https://www.plymouth.ac.uk/staff/vasilioskelefouras

#### Computer Architectures – Last Pieces of the Puzzle

#### Too many puzzling words:

• x86, RISC, CISC, EPIC, VLIW, Harvard architecture

• SIMD

Assignment Project Exam

Microcontrollers, ASIC, ASIP,
 FPGA, GPU, DSP
 https://pewcoder.com

 Pipeline, vector processing, superscalar, hyper-threading,

multi-threading

Heterogeneous systems







#### **Outline**

- Different computer architectures classified regarding purpose
- General Purpose Processors
- Application Spacific Project Exam Help
- Coprocessors / accelerators
- Multi-core processohttps://powcoder.com
- Many-core processors
  Simultaneous Multithreading

  Many-core processors

  Add WeChat powcoder
- Single Instruction Multiple Data
- Heterogeneous Systems



# Computer architectures – classified regarding purpose (1)



Fig.1. CPU market analysis

# Computer architectures – classified regarding purpose (2)

- 1. General Purpose Processors
- 2. Specific PAssignmento Broject Exam Help
- https://powcoder.com
  Accelerators, also called co-processors

Add WeChat powcoder

### General Purpose Processors (GPP)

- They are classified into:
  - 1. **General purpose microprocessors** general purpose computers, e.g., desktop PCs, laptops
    - Very powerful Assignment Amojach ExamoHelp
    - Superscalar and Out of Order, big cache memories, lots of pipeline stages
  - 2. Microcontrollers Enlattque d'sparents Coder. Com
    - Less powerful CPUs, e.g., ARM, Texas Instruments
    - They are usually designed for specificata to the total ed systems
    - They usually have control oriented peripherals
    - They have on chip CPU, fixed amount of RAM, ROM, I/O ports
    - Lower cost, lower performance, lower power consumption, smaller than microprocessors
    - Appropriate for applications in which cost, power consumption and chip area are critical

### GPP - General Purpose Microprocessor

- General Purpose Microprocessor general purpose computers
  - They are designed for general purpose computers such as PCs, workstations, Laptops, notepads.etc

    Assignment Project Exam Help
    Higher CPU frequency than microcontrollers

  - Higher cost than midratepts of powcoder.com
  - Higher performance than microcontrollers
  - Higher power consumption Won Chatch Power oder
  - General purpose processors are designed to execute multiple applications and perform multiple tasks

#### GPP - Microcontrollers



Fig.2. Microcontrollers

Fig.3. Components of the Microcontroller

#### **Application Specific Processors (1)**

- General purpose processors offer good performance for all different applications but specific purpose processors offer better for a specific task
- Application specific processors emerged as a solution for Assignment Project Exam Help
  - □ higher performance
  - lower power conshitted in power der.com
  - Lower cost
- Application specific processors have become where our life and can be found almost in every device we use on a daily basis
- Devices such as TVs, mobile phones and GPSs they all have application specific processors
- They are classified into
  - Digital Signal Processor (DSPs)
  - 2. Application Specific Instruction Set Processors (ASIPs)
  - 3. Application Specific Integrated Circuit (ASICs)

### Digital Signal Processors (DSPs)

- DSP: Programmable microprocessor for extensive real-time mathematical computations
  - specialized microprocessor with its architecture optimized for the operational needs of digital signal processing open Exam Help
  - DSP processors are designed specifically to perform large numbers of complex arithmetic calculations and as quickly as possible
  - DSPs tend to have addifferent edithmetic Unit architecture;
    - specialized hardware units, such bit reversal, multiple Multiply-Accumulate (MAC) units etc
    - Normally DSPs have a small instruction cache but no data cache memory

#### Application Specific Instruction set Processor (ASIP)

- 2. **ASIP:** Programmable microprocessor where hardware and instruction set are designed together for one special application
  - Instruction set, micro architecture and/or memory system are customised for an application of tamily of application.
  - Usually, they are divided into two parts; static logic which defines a minimum ISA and configurable logic which can be used to design new instructions
  - The configurable logic gan the programmed and extend the instruction set similar to FPGAs
  - better performance, lower cost, and lower power consumption than GPP

## Application Specific Integrated Circuit (ASIC)

#### **ASIC:** Algorithm completely implemented in hardware

- An Integrated Circuit (IC) designed for a specific line of a company full custom
- Assignment Project Exam Help

  It cannot modified—it is produced as a single, specific product for a particular application only https://powcoder.com

  Proprietary by nature and not available to the general public
- ASICs are full custon Athere to the are quire wearing the development costs
- ASIC is just built for one and only one customer
- ASIC is used only in one product line
- Only volume production of ASICs for one product can make sense which means low unit cost for high volume products, otherwise the cost is not efficient
- There is a lot of effort to implement an ASIC there are specific languages such as VHDL and Verilog

Consider that we want to build and application specific system. We can choose:

#### 1. GPP

■ Functionality of the system is exclusively build on the software level • it is not efficient in term of performance, power consumption, cost, chip oneddn WeChat powcoder heat dissipation

#### 2. ASIC:

No flexibility and extensibility

#### 3. ASIP:

- a compromise between the two extremes
- used in embedded and system-on-chip solutions



### **Performance**

Fig.4. Comparison between Performance and flexibility

# Building an application specific system on an embedded system (2)

Table 1. Comparison between different approaches for Building Embedded Systems [1]

| Assi        | gament Pro  | ject Exam           | Help             |
|-------------|-------------|---------------------|------------------|
| Performance | Low         | High                | Very High        |
| Flexibility | https://pow | coder.com           | Poor             |
| HW design   | Add WeCh    | Large               | Very large       |
| SW design   | Small       | Large               | None             |
| Power       | Large       | Medium              | Small            |
| reuse       | Excellent   | Good                | Pure             |
| market      | Very large  | Relatively<br>large | Small            |
| Cost        | High        | Medium              | Volume sensitive |

# Accelerators - coprocessors

 Accelerators / co-processors are used to perform some functions more efficiently than the CPU

#### Assignment Project Exam Help

- They offer
  - Higher performancettps://powcoder.com
  - Lower power consumption We Chat powcoder
  - High Performance per Watt
  - But they are harder to program

## Field Programmable Gate Arrays (FPGAs)

- FPGAs are devices that allow us to create our own digital circuits
- An FPGA (Field Programmable Gate Array) is an array of logic gates that can be hardware programmed to fulfill user-specified to sks
  - FPGAs contain programmable logic components called "logic blocks", and a hie totally: of personigherable interconnects that allow the blocks to be "wired together"
  - An application can be implemented entirely in HW
  - The FPGA configuration is generally specified using a hardware description language (HDL) like VHDL and Verilog – hard to program
  - High Level Synthesis (HLS) provides a solution to this problem. Engineers write C/C++ code instead, but it is not that efficient yet

#### FPGAs (2)

FPGAs come on a board. This board is connected to a PC and programmed.
 Then, it can work as a standalone component



## FPGAs (3)

- Unlike an ASIC the circuit design is not set and you can reconfigure an FPGA as many times as you like!
  - Creating an ASIC; also costs potentially millions of dollars and takes weeks or months to create.
  - However, the recur**hittps** of **powerthan them** of the FPGA (no silicon area is wasted in ASICs).
  - ASICs are cheaper and when the production number is very high
- Intel plans hybrid CPU-FPGA chips

#### GPUs (1)

- Graphics Processing Unit (GPU)
  - The GPU's advanced capabilities were originally used primarily for 3D game graphics. But now those capabilities are being harnessed more broadly to accelerate computational workloads in other areas too
  - GPUs are very efficient for https://powcoder.com
    - Data parallel applications
    - Throughput intensive approved the constant intensive approved the constant intensive approved to process lots of data elements



#### GPUs (2) – why do we need GPUs?



## GPUs (3)

- A GPU is always connected to a CPU GPUs are coprocessors
- GPUs work in lower frequencies than CPUs
- □ GPUs have Augrigmores in Perlo joe the second 1000 p
- GPUs have smaller and faster cache memories https://powcoder.com
- https://powcoder.com

  OpenCL is the dominant open general-purpose GPU computing language, and is an open stemplate powcoder
- The dominant proprietary framework is Nvidia CUDA

#### Schematic of Nvidia GPU architecture



Multiple cores on the same chip using a shared cache **Processor Core Processor Core** Typically from 2-8 cores Assignment Project Exam Help
Both cores compete for the same L1 L1 https://poweagler.com Instruction hardware resources Data Cache Cache Both cores are identical Every core is a superscalardal WeChat powcoder L2 Cache order CPU L3 (Last-Level) Cache Memory

#### Multi-core CPUs – ARM Cortex-A15



#### ARM Cortex-A15



#### Multi-core CPUs - Intel i7 architecture

In the figure below there is the Intel i7 CPU, where four CPU cores and the GPU reside in the same chip

#### Assignment Project Exam Help



#### Many core Processors – Intel Xeon Phi

- They are intended for use in supercomputers, servers, and high-end workstations
- 57-61 in-order simplifying Project Exam Help i7 cores
- □ 1-1.7 Ghz
- □ 512bit vector instructions dd
- each core is connected to a ring interconnect via the Core Ring Interface



# Comparison



#### Performance, Area and Power Efficie

#### CPU:

- Market-agnostic
- Accessible to many programmers (Python, C++)Verilog)
- Flexible, portable

#### FPGA:

- Somewhat Restricted Market
- Harder to Program (VHDL,
- - More efficient than SW
  - More expensive than ASIC

#### **ASIC**

- Market-specific
- Fewer programmers
- Rigid, less programmable
- Hard to build (physical)

## Superscalar and Out of Order is not enough (1)

The approach of exploiting ILP through superscalar execution is seriously weakened by the fact that normally programs don't have a lot of finegrained parallelism in them

Assignment Project Exam Help

Because of this, CPUs normally don't exceed more than 3 instructions per cycle when we mainstream and powcoder most mainstream, real-world software, due to a combination of load latencies, cache misses, branching and dependencies between instructions

Reservation

Stations

ALU

Reorder



### Superscalar and Out of Order is not enough (2)

- Issuing many instructions in the same cycle only ever happens for short bursts
- Moreover, the dispatch logic of a 5-issue processor is more flas 535 ment Project Exam Help issue design (chip area), with 6-jssue being more than twice as large, 7-is the street Primes mestize, om 8-issue more than 4 times larger than 4-issue (for only 2 times the width), and so

**Exploiting instruction level** parallelism is expensive



**Fetch** 

## Superscalar and Out of Order is not enough (3)

Very important features that further improve the performance of CPUs are:

- Simultaneous multi threading (SMT) or Hyper Threading in Intel processors ASSIGNMENT Project Exam Help
- □ Single Instruction Multiple Data (SIMD) vectorization

https://powcoder.com

Add WeChat powcoder

# Simultaneous multi-threading (SMT) as a solution to improve CPU's performance (1)

- SMT is the process of a <u>CPU</u> splitting each of its physical <u>cores</u> into virtual cores
- Normally 2 threads are executed in one physical CPU core
- If additional independent instructions aren't available within the program being executed, there is another potential source of independent instructions— other funning programs, or other threads within the same program

https://powcoder.com

- simultaneous multi-threading (SMT) is a processor design technique which exploits exactly this type of thread-level parallelism POWCOGET
- Fill the empty bubbles in the pipelines with useful instructions, but this time rather than using instructions from further down in the same code, the instructions come from multiple threads running at the same time, all on the one processor core
- So, an SMT processor appears to the rest of the system as if it were multiple independent processors, just like a true multi-processor system

# Simultaneous multi-threading (SMT) as a solution to improve CPU's performance (2)

- From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the "execution state" of each thread
  - These parts only constitute a tiny fraction of the overall processor's hardware Assignment Project Exam Help
     The really large and complex parts, such as the decoders and dispatch logic,
  - The really large and complex parts, such as the decoders and dispatch logic, the functional units, and the caches, are all shared between the threads <a href="https://powcoder.com">https://powcoder.com</a>
- On top of this, the fact that the threads in an SMT design are all sharing just one processor core and just opened to a true multi-processor (or multi-core)
- SMT performance can actually be worse than single-thread performance
- Speedups from SMT on the Pentium 4 ranged from around -10% to +30% depending on the application(s)

# Single Instruction Multiple Data (SIMD) – Vectorization (1)

- In addition to instruction-level parallelism, there is yet another source of parallelism – data parallelism
- Rather than looking for ways to execute groups of instructions in parallel, the idea is to look for ways to execute groups of instructions in parallel rather in parallel
- This is sometimes called SIMD parallelism (single instruction multiple data).

  More often, it's called vector processing

  Add WeChat powcoder



# Single Instruction Multiple Data (SIMD) – Vectorization (2)



### Single Instruction Multiple Data (SIMD) – Vectorization (3)

- There is specific hardware (HW) supporting a variety of vector instructions as well as wide registers
  - General Purpose Microprocessors
    - Laptops, desk Assignment Project Exam Help
    - From 64-bit up to 512-bit vector instructions all kind of instructions are supported, e.g., loadytype,: polynowitiply if repolitions
  - Microprocessors for embedded systems or Microcontrollers
    - From 32-bit up to 120 de verter inhattip@wcoder
    - limited instruction set for Microcontrollers, but not for microprocessors



## Single Instruction Multiple Data (SIMD) – Vectorization (4)

- Modern compilers use auto-vectorization the compiler does this for us
- For applications where this type of data parallelism is available and easy to extract, SIMD vector instructions can produce amazing speedups ASSIGNMENT Project Exam Help
- Unfortunately, it's quite difficult for a compiler to automatically make use of vector instructions <a href="https://powcoder.com">https://powcoder.com</a>
  - hand written code is more efficient
  - The key problem is that the way programmers write programs tends to serialize everything, which makes it difficult for a compiler to prove two given operations are independent and can be done in parallel.
- Rewriting just a small amount of code in key places has a widespread effect across many applications
- Almost every CPU has now added SIMD vector extensions

## More cores. More Threads. Wider vectors

## Assignment Project Exam Help







| Intel® Xeon Phi®<br>coprocessor<br>Knights<br>Corner | Intel' Xeon Phi™<br>processor &<br>coprocessor<br>Knights<br>Landing¹ |
|------------------------------------------------------|-----------------------------------------------------------------------|
| 61                                                   | 70+                                                                   |
| 244                                                  | 280+                                                                  |
| 512                                                  | 512                                                                   |

#### Hardware Trends

## From single core processors to heterogeneous systems on a chip

40



H. Esmaeilzadeh et al., "Dark silicon and the end of multicore scaling", International Symposium on Computer Architecture (ISCA). ACM, 2011.
M. Zahran, "Heterogeneous Computing Here to Stay". ACM Queue, Nov/Dev 2016.

Unrestricted © Siemens AG 2017

### The CPU frequency has ceased to grow



# Moore's Law Is <u>STILL</u> Going Strong Hardware performance potential continues to grow

"We think we can continue Moore's Law for at least another 10 years."

Intel Senior Fellow Mark Bohr, 2015



### Hardware Evolution

- Scalar Processors
- Pipelined Processors
- Superscala Shigument Broject Exam Help
- Out of order Processors powcoder.com
- Vectorization
- Add WeChat powcoder
- Heterogeneous systems

**Time** 

### Heterogeneous computing (1)

Single core Era -> Multi-core Era -> Heterogeneous Systems Era

- Heterogeneous Acomiguting rate of style to the style of processors or cores
  - These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar (co)-processors, usually incorporating specialized processing capabilities to handle particular tasks
  - Systems with General Purpose Processors (GPPs), GPUs, DSPs, ASIPs etc.
- Heterogeneous systems offer the opportunity to significantly increase system performance and reduce system power consumption

## Heterogeneous computing (2)

- Software issues:
  - Offloading
  - Programmability think about CPU code (C code), GPU code (CUDA), FPGA code (VHDL) Assignment Project Exam Help
  - Portability What happens if your code runs on a machine with an FPGA instead of a GPU <a href="https://powcoder.com">https://powcoder.com</a>



## Heterogeneous computing (3) – A mobile phone system



#### Think-Pair-Share Exercise

- □ What is in your opinion the most appropriate computer architecture for a smart phone and why?
  - a. 1 microcontroller Project Exam Help
  - b. 1 normal speed GPP, e.g., Pentium II
  - c. 1 quad-corehttes://powcoder.com
  - d. A heterogeneous computer architecture with 1 normal speed GPP, 1 DSP, 1 GPU and a few Microcontrollers

#### Conclusions

- Modern Computer Systems include Parallel Heterogeneous Computer Architectures
- Heterogeneous systems offer the opportunity to significantly https://powcoder.com
  - increase performance
  - reduce power consymblisheChat powcoder
  - reduce cost
- □ Issues:
  - Programmability
  - Portability
  - Design good Compilers optimize the code

## References and Further Reading

- [1] Nohl, A, Schirrmeister, F & Taussig, D. "Application specific processor design: Architectures, design methods and tools" Computer-Aided Design (ICCAD), 2010 IEEE/ACM International Conference on Nov. 2010.
- [2] Tom Spyrou Challens ignessignes and Throje of Strange Light ERA, TAU 2015
- [3] Modern Microprocessors Attp-Minup ( With Communication ) http://www.lighterra.com/papers/modernmicroprocessors/
- [4] Introduction to GPU compared a Wite Ghat powcoder

  http://www.int.washington.edu/PROGRAMS/12-2c/week3/clark 01.pdf
- [5] Yousef Qasim, P Radyumna Janga, Sharath Kumar, Hani Alesaimi, APPLICATION SPECIFIC PROCESSORS, ECE/CS 570 PROJECT FINAL REPORT, available at <a href="http://web.engr.oregonstate.edu/~qassimy/index\_files/Final\_ECE570\_ASP\_2012\_Project\_Report.pdf">http://web.engr.oregonstate.edu/~qassimy/index\_files/Final\_ECE570\_ASP\_2012\_Project\_Report.pdf</a>
- [6] William Stallings, Computer Organization & Architecture. Designing for Performance, Seventh Edition
- [7] Andrew S. Tanenbaum, Todd Austin, Structured Computer Organization. Sixth Edition, PEARSON

Assignment Project Exam Help Thank you

https://powcoder.com

Add WeChat powcoder

Date 04/11/2019

School of Computing (University of Plymouth)